[1] 38 53 14
Every research project aims to answer a research question (or multiple questions).
Do ECU students who exercise regularly have a higher GPA?
Each research question aims to examine a population.
Population for this research question is ECU students.
It is impossible to study the whole population related to a research question.
A sample \(n\) is a subset of the population \(N\).
The Goal: Select a representative sample to generalize to the broader population.
What is representative?
Data quality matters more than data quantity
Many anthropological studies (or similar) are convenience based.
Every member of a population has an equal chance of being selected.
To Generalize:
Similar to a simple random sample BUT intervals are chosen at regular intervals.
# 1. Create a population (e.g., a vector of 1 to 1000)
population <- 1:1000
# 2. Define the desired sample size
sample_size <- 100
# 3. Calculate the sampling interval (k)
N <- length(population) # Population size
k <- N / sample_size
# If k is not an integer, you might use ceiling(N/n) and adjust the logic
# 4. Choose a random starting point (r) between 1 and k
set.seed(123) # Optional: for reproducible results
start_point <- sample(1:k, 1)
# 5. Select every k-th element starting from the random start point
systematic_sample_indices <- seq(from = start_point, to = N, by = k)
systematic_sample <- population[systematic_sample_indices]
# 6. View the first few elements and the dimension of the sample
head(systematic_sample)[1] 3 13 23 33 43 53
[1] 100
set.seed(123)
population <- data.frame(
Supermarket = paste("Supermarket", 1:1000, sep = "_"),
CustomerSatisfaction = rnorm(1000, mean = 75, sd = 10)
)
selected_supermarkets <- sample(population$Supermarket, size = 10, replace = FALSE)
sampled_data <- population[population$Supermarket %in% selected_supermarkets, ]
head(sampled_data) Supermarket CustomerSatisfaction
203 Supermarket_203 72.34855
225 Supermarket_225 71.36343
255 Supermarket_255 90.98509
354 Supermarket_354 76.16637
457 Supermarket_457 86.10277
554 Supermarket_554 77.49825
set.seed(123)
region <- data.frame(
Neighborhood = paste("Neighborhood", 1:500, sep = "_"),
AverageIncome = rnorm(500, mean = 50000, sd = 10000)
)
households <- data.frame(
Neighborhood = rep(sample(region$Neighborhood, size = 500, replace = TRUE), each = 20),
HouseholdID = rep(1:20, times = 500),
EmploymentStatus = sample(c("Employed", "Unemployed"), size = 10000, replace = TRUE)
)
selected_neighborhoods <- sample(region$Neighborhood, size = 5, replace = FALSE)
sampled_households <- households[households$Neighborhood %in% selected_neighborhoods, ]
head(sampled_households) Neighborhood HouseholdID EmploymentStatus
1981 Neighborhood_302 1 Unemployed
1982 Neighborhood_302 2 Employed
1983 Neighborhood_302 3 Employed
1984 Neighborhood_302 4 Employed
1985 Neighborhood_302 5 Unemployed
1986 Neighborhood_302 6 Unemployed
set.seed(123)
states <- data.frame(
State = paste("State", 1:50, sep = "_"),
Population = sample(1000000:5000000, 50, replace = TRUE)
)
counties <- data.frame(
State = rep(sample(states$State, size = 50, replace = TRUE), each = 20),
County = rep(paste("County", 1:20, sep = "_"), times = 50),
VaccinationRate = rnorm(1000, mean = 70, sd = 5)
)
selected_states <- sample(states$State, size = 3, replace = FALSE)
selected_counties <- sample(counties$County[counties$State %in% selected_states], size = 5, replace = FALSE)
sampled_vaccination_centers <- counties[counties$County %in% selected_counties, ]
head(sampled_vaccination_centers) State County VaccinationRate
8 State_32 County_8 70.37428
11 State_32 County_11 66.86024
13 State_32 County_13 70.81309
15 State_32 County_15 67.68222
19 State_32 County_19 70.91839
28 State_46 County_8 68.84869
How do we infer future events or population characteristics?
In a random process there is more than one possible outcome.
The set of all possible outcomes of a random process.
An event is a subset of the sample space.
Examples with a 6-sided die:
A represent the event that a single roll die results in an even number.
A = {2, 4, 6}B represent the event that a single roll die results in an odd number.
B = {1, 3, 5}C represent the event that a single roll die results in a prime number.
C = {2, 3, 5}The set of all outcomes in the sample space that are not in the event itself.
Example:
C represent the event that a single roll die results in a prime number.
C = {2, 3, 5}= {1, 4, 6}A represent the event that a single roll die results in an even number.
A = {2, 4, 6}B represent the event that a single roll die results in an odd number.
B = {1, 3, 5}C represent the event that a single roll die results in a prime number.
C = {2, 3, 5}Events \(A\) and \(B\) are mutually exclusive because an outcome cannot be both even + odd.
Events \(A\) and \(C\) are not mutually exclusive because the outcome 2 is both even + prime.
| Description | Notation | Reading | Elements |
|---|---|---|---|
| Union | \(A \cup C\) | A or C | {2, 3, 4, 5, 6} |
| Intersection | \(A \cap C\) | A and C | {2} |
set.seed(1)
OBV <- 1:10
Dist1 <- NULL
Dist9 <- NULL
Dist16 <- NULL
Dist25 <- NULL
Dist36 <- NULL
count = 100
while(count > 0){Dist1 <- c(Dist1,sample(OBV, 1, replace = TRUE)); count <- count - 1}
count = 100
while(count > 0){Dist9 <- c(Dist9,mean(sample(OBV, 9,replace = TRUE) ) ); count <- count - 1}
count = 100
while(count > 0){Dist16 <- c(Dist16,mean(sample(OBV, 16,replace = TRUE) ) ); count <- count - 1}
count = 100
while(count > 0){Dist25 <- c(Dist25,mean(sample(OBV, 25,replace = TRUE) ) ); count <- count - 1}
count = 100
while(count > 0){Dist36 <- c(Dist36,mean(sample(OBV, 36,replace = TRUE) ) ); count <- count - 1}
Dist.df <- data.frame(Size = factor(rep(c(1,9,16,25,36), each=100)), Sample_Means = c(Dist1, Dist9, Dist16, Dist25, Dist36) )
ggplot(Dist.df, aes(Sample_Means, fill = Size)) + geom_histogram() + facet_grid(. ~ Size)